View of Hoover Dam. Source: Wikimedia Commons

Does More “Green” Mean More Green Energy?

An investigation into whether a country’s hydroelectricity output is positively and directly related to GDP.

Chris Chae, Martin Hsu, Gillian Ippoliti, and Sydney Oberg

STAT 331, California Polytechnic State University



Background

Do richer countries produce more hydroelectric power? Hydroelectric power is harnessed by using running water to turn a turbine which then powers a generator to convert it to electricity. Water has been used as a power source for millenia, but as technology has advanced, more sophisticated hydroelectric power systems have been utilized as an more sustainable alternative to fossil fuels. As with many sustainable advancements, the cost of switching to hydroelectric power instead of fossil fuels can be a significant obstacle. This leads to the question of how a country’s GDP can influence their utilization of hydroelectric power. Overall, have wealthier countries been quicker to utilize this sustainable power source? Our team investigated the connection between GDP and hydroelectric power using data from GapMinder, a global trends data source, from 1959 to 2010.

Obtaining the Data

In order to investigate this question, our team pulled and cleaned two datasets from gapminder.org – one on hydro electricity production per person, and another on GDP per capita, by country and year.

We combined the datasets and created a linear model comparing the hydroelectricity production to GDP. The final dataset observations can be previewed below.

The following variables are included:

Visualizing Linear Models

Below is a visualization of the data directly plotting the hydroelectric production as a response of GDP per capita, regardless of year or country. We can see that there is a positive association between the two variables. So, higher GDP may mean more hydroelectric power per person.

To investigate how this relationship has changed over time, we separated the datapoints by year and created a regression for each. We also plotted the slopes of each year as a line graph, showing how it changed over time. Both can be seen below. Over time, the slope is generally positive and this would show that there is a consistent positive relationship between GDP and hydroelectricity output per person. As seen in the regression plot and the plot of the slope over time, the most positive relationship between GDP and hydro-power was in the 1960s. However, the slope becomes less and less positive over time.

This means that the correlation between richer countries having more hydropower output is weakening over time!

Fitting a Linear Model Equation

Based on the data and visualizations, we fit an overall linear model where hydroelectric production is the response and GDP is the predictor. The components can be seen in the table below. Of particular interest is the estimate column, highlighted in red. It shows the estimated intercept (from the (Intercept) row) and slope (from the gdp_person row).

term estimate std.error statistic p.value
(Intercept) 0.0015511 0.0051454 0.3014634 0.7630772
gdp_person 0.0000086 0.0000003 31.4404317 0.0000000

Our final equation is therefore:

\[ (\operatorname{Hydroelectric\ Production\ per\ Person}) = \alpha + \beta_{1}(\operatorname{GDP\ per\ Capita}) + \epsilon \]

Where the intercept \(\alpha = 0.0015511\), the slope \(\beta_{1} = 0.0000086\) and \(\epsilon\) is the random error “noise.”

This can be interpreted as for every 1$USD increase in GDP per capita, hydroelectric energy output per person increases by about 8.6E-6 tonnes of oil equivalent. A country with GDP per capita of 0 would have 0.0016 toe of hydroelectric output.

Is the Model a Good Fit?

However, let’s go back to the visualization. the data does not seem very linear overall. It looks like a lot countries have very low hydroelectric output across a wide range of the GDP. The data seems to “fan” out as GDP increases and doesn’t clearly follow the regression line. As a result, before we can draw the conclusions we made before, we have to assess if linear model actually is a good fit based on the appropriate statistical indicators.

Below are the variance for the following model aspects:

The observed variance is the total variance, split between the fitted and residual variance.

statistic hydro_person .fitted .resid
variance 0.0850756 0.0171156 0.06796

Most of the variance in hydroelectric production is covered by residuals, which means it is mostly unexplained variance. This is not a good sign for the quality of our model.

A good indicator of model fit based on these values is RSquared. It answers the question “What percent of variability in hydroelectric output is explained by GDP per capita?” The closer RSquared is to 1, the better fit the GDP per capita model is, as it explains a higher percentage of the variability. The RSquared can be found in the table below, highlighted in red:

r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.2011805 0.200977 0.2607247 988.5007 0 1 -292.1429 590.2859 609.1128 266.8111 3925 3927

The RSquared is about 0.20. Therefore, the proportion of the variability in the response values that was accounted for by the regression generated by hydroelectric energy output per person and GDP is about 20%. This means that the GDP in the regression plot explains about 20% of the Hydro-electric output variability in each country. The fit of the created regression plot is not as strong as many would prefer, but it still has enough of a correlation to show that there is a connection between overall richer countries and higher use of hydro-electric power. Although, since only 20% of the hydro-electric output per person can be explained in the regression plot, we are led to believe that there are other variables that likely have an effect on which countries use more hydro-electric power.

Assessing Fit with Model Simulation

We have another way that we can assess the fit of the model. If we use our linear model to simulate hydroelectric output data points, and our simulated data looks similar to our observed data, then our regression is a good fit for the data.

As a result, we simulated hydroelectric output values based off our model, plotted the distribution of the data, and compared it to the distribution of the observed hydroelectric output.

The distributions are very different. The observed hydro electric per person distribution is very right-skewed while the simulated values are normally distributed. This likely means that our prediction model is not a good fit for the observed data.

To confirm this, we plotted the observed data as a response of the simulated data, as seen below. The closer the simulated data is to the predicted data, the closer the data will line up against a line with slope 1 on the graph. This line appears on the graph as a dashed red line.

The values in our plot of observed hydro energy output per person and simulated hydro energy output per person do not appear to fit to the slope line of 1. This means that the predicted hydro energy output per person does not generate data the is similar to the observed data.

Simulation with Iteration

But what if our simulated values for that one trial just happen to not fit the observed data very well? In order to confirm that this single simulation is not just an unlikely outcome, we simulated this process 1000 times. This means we generated 1000 sets of simulated values and compared each of the simulated sets against the set of the observed data for 1000 comparisons in all.

The plot below shows the distribution of 1000 RSquareds, one for each simulated set compared against observed data. The generated plot shows that the overall average RSquared value for the simulations would be between .04 and .05. This indicates that there is a high degree of certainty that very little of the variation in the data can be explained by simulated values from our model. Our model is not a good fit.

We also found the correlation coefficient, or r value, for each simulated comparison as well. This plot below of simulated r values confirms that with the simulated values it would be most likely that there was a weak positive or no correlation between the observed value and regression model.

Conclusion

From our analysis, we can conclude that though the linear correlation is positive (meaning that higher GDP on average means more hydroelectric energy production), there is essentially very little evidence of direct correlation between the hydroelectricity production and the gross domestic product. Based on the graph, you can see that there is no linear consistency in the data points as majority of them are clumped towards the bottom of the graph with some points higher in the graph making the linear regression a little higher. The model and simulated RSquareds were very low. From that, we concluded that our model is not a good fit.

There may be other factors to explore that explain hydroelectric production better than just GDP. Exploring the relationship by year and by country may also yield interesting and more significant results.

Documentation Consulted